Conversation
…tead of logic operations when computing the neighboring index; this is branch free and less operations
…quarter precision support
…for executing single-thread regions of code. On CUDA install the latest version of CCCL via CPM since we need some new features
…slash kernels. Disabled by default (set with with Arg::prefetch_distance parameter), and TMA prefetch will be added in next push
…ith QUDA_DSLASH_PREFETCH_BULK=ON). Prefetch distance is now set via CMake (QUDA_DSLASH_PREFETCH_DISTANCE_WILSON and QUDA_DSLASH_PREFETCH_DISTANCE_STAGGERED)
…ble on CUDA platform
…ants of vector_load and vector_store: these allow for hte pointer offset and the index to be computed together first in 32-bit, before accumulation to the pointer in 64-bit, reducing pointer arithmetic overheads
…d and vector_store to reduce indexing overheads
…tNOrder uses optimized 3-operand indexing
TMA (Tensor Memory Accelerator) is only available on Hopper (sm_90+) and later architectures. This commit wraps the cuTensorMapEncodeTiled calls with a compile-time guard to prevent runtime errors on Volta/Ampere GPUs.
…, placing the end face (which is otherwise lost) into the ghost
…the gauge field from the ghost region - ensures coalesced access regarding less of partitioning
… - comms partitioning were effectively disabled for testing
…ow created unless TENSOR prefetching type is enabled
|
cscs-ci run |
|
Is performance on AMD regularly benchmarked "officially"? If so, what is being benchmarked? After recent updates on Lumi-G I had to update our production stack. I was not able to compile the head commit of the develop branch any more (related to what is observed in #1617 I think): nor the (now very old) commit that we used on Lumi-G previously (6198d60): which I couldn't fix by trying to backport the changes to quda::complex. We figured out that the feature/prefetch2 branch compiles, but I observe substantial performance regressions in our tmLQCD+QUDA HMC compared to our production setup which was running until December 2025:
Overall this leads to a factor > 2 increase in time per trajectory unfortunately. I'm unable to pin down what is responsible as we had to update from rocm-5.6.1 (very old, I know, but that was what was available on Lumi-G at the time) to rocm-6.3.4 or rocm-6.4.4 AND make a very large jump in QUDA version. |
|
@kostrzewa thanks for the report on where things stand on ROCm. I think the issue with compilation should be fixed with Regarding the performance regression, do you happen to have a tune cache to hand for before and after? That would help guide us as to where the regression is. I suspect the issue is a compiler driven regression in the dslash performance, but it could also be changes in QUDA itself. Since that old version of QUDA, on of the biggest changes has been the default data ordering has changed, to what I call "maximal vectorization". What this means, is for example we previously would have use |
…with shifting (can't shift a shifted field), and fix move constructor so that shift field is moved
…g some bug hunting" This reverts commit 8c7ba4d.
Ah, I always forget to look at the tunecaches. Yes, please find them attached here: quda_amd_perf_regression.tar.gz the directory names in the archive should be reasonably self-explanatory. Looking at some of the kernels in profile_async_0.tsv seems to confirm my observations from the tmLQCD-internal timers w.r.t. to the MG as well as the ndeg twisted clover half precision kernels:
I'll try this right away, thanks! Going back to the legacy order helps a little. The situation is subtle because on a 32c64 lattice on 2 nodes (16 GCDs), the prefetch2 branch and legacy order I actually see a slight overall performance improvement with CrayEnv_gnu_rocm_644 over the old commit with gnu_env_23_09_rocm_561. On 28 nodes on a 112c224 lattice instead I see, as an example: rocm-561 / 6198d60:rocm-644 / prefetch2 3c8ed1a / defaultsrocm-644 / prefetch2 3c8ed1a / legacy orderNote that these inversions have identical starting conditions. I guess it's the autotuning which causes the iteration numbers to differ a little. The main point is the time per iteration though / the reported performance. Sorry for polluting the discussion here with so much stuff. I guess I should have opened a new issue for this... |
This work is latest towards optimizing QUDA for Blackwell:
vector_load. At present, not deployed anywhere.QUDA_DSLASH_PREFETCHCMake parameter, with 0=per-thread, 1=TMA bulk, and 2=TMA descriptortarget::is_thread_zero()which should be used for TMA issuance.QUDA_DSLASH_DOUBLE_STORE=ONwhich is required for TMA-based prefetching (for alignment reasons).* Prefetching is exposed for both ColorSpinorFields and GaugeFields, though only latter actually used at present.QUDA_DSLASH_PREFETCH_DISTANCE_WILSONandQUDA_DSLASH_PREFETCH_DISTANCE_STAGGEREDCMake parameters.vector_loadandvector_storeto this end (respectively).intwith division byfast_intdiv)The end result of this work is that both Staggered and Wilson dslash kernels can saturate over 90% memory bandwidth for most variants. Outstanding are half precision variants using reconstruction, that are still lagging. These will be the focus of a subsequent PR.